Week 4: High-dimensional visualisation and analytics in R
The University of Sydney
Important
hclust (hierarchical clustering)kmeans (K-means clustering)dist (distance matrix)Rtsne::rtsne (t-SNE)cmdscale (Multi-Dimensional Scaling)This presentation is based on the SOLES reveal.js Quarto template and is licensed under a Creative Commons Attribution 4.0 International License.
Supervised Learning: Given a dataset with features/predictors (X_{i1}, X_{i2}, ...X_{ip}) and a target outcome variable (Y_i), the goal is to learn a model f(X_{i1}..X_{ip}) to explain/predict the target.
BuildingArea and Price of houses in St Kilda?Unsupervised learning: Given a dataset with features (X_{i1}, X_{i2}, ...X_{ip}), the goal is to visualise, find patterns, subgroups/clusters within the data.
Clustering Goals
Compact Clusters: Observations within a cluster should be close together.
Well-Separated Clusters: Observations from different clusters should be far apart.
Example algorithms:
\begin{align*} \text{WSS}_k = \frac{1}{|C_k|} \sum_{i,j \in C_k} || x_i - x_j||^2, \quad k=1,\ldots,K \end{align*}
This is the sum of all the pairwise squared Euclidean distances between observations in the k^{\rm th} cluster, divided by total number of observations in the k^{\rm th} cluster
\sum_{i,j \in C_k} || x_i - x_j||^2 is the sum of all the pairwise squared Euclidean distances between observations in the k^{\rm th} cluster.
|C_k| is the total number of observations in the k^{\rm th} cluster
To minimize the total within-group sum of squares criterion \begin{align*} \text{WSS}_{\text{Total}} = \sum_{k = 1}^K \text{WSS}_k \end{align*}
The total within sum of square criterion will decrease as K increases
Rule of thumb: Look for the elbow
Initialise each observation at random to a cluster.
Iterate the following until the assignments stop changing:
Suppose we have a data matrix X with n observations and p features
Naively, we can do all pairwise combinations, i.e., 1 vs 2, 1 vs 3, \ldots, (p vs (p - 1))
Principal Component Analysis (PCA) helps us reorganise/transform the data into a new coordinate system.
Start with a data matrix \boldsymbol{X}, and assume it has mean zero \begin{align*} \boldsymbol{X} = \begin{pmatrix} X_1, X_2 , \ldots , X_p \end{pmatrix} \end{align*} The first principal component is the normalised linear combination of the features that maximises the variance in the new component \begin{align*} Z_1 = \phi_{11}X_{1} + \phi_{21}X_2 + \cdots + \phi_{p1}X_p = \sum_{j = 1}^p \phi_{j1}X_j = \boldsymbol{X}\boldsymbol{\phi}_1, \quad \boldsymbol{\phi}_1 = (\phi_{11},\ldots,\phi_{p1})^T \end{align*} The elements \phi_{j1} are known as the loadings of the first principal component
Variables may have different units
This leads to different variances, affecting PCA results.
Murder, Rape and assault are reported as the number cases per 100,000 people while UrbanPop is measured in % term.
Variances associated with four features are 18.97, 87.73, 6945.16 and 209.5.
library(gridExtra)
data <- USArrests
var.dat <- apply(data, 2, var)
var.dat <- var.dat %>% t %>% as.data.frame %>% gather
pca.dat <- usarrest.pca$sdev^2/4
ggplot(data.frame(variance = pca.dat * 100, Component = 1:4)) +
geom_col(aes(x = Component, y = variance)) +
theme_minimal() +
labs(y = "Variance explained (%)", x = "Principal Component")t-SNE helps visualize high-dimensional data by preserving local structures in a lower-dimensional space.
Mathematical definition:
\begin{align*}
p_{j|i}(\sigma^2_i) = \frac{\phi(x_j; x_i, \sigma_i^2)}{\sum_{k \neq i}\phi(x_k; x_i, \sigma_i^2)}
\end{align*}
Mathematical definition:
\begin{align*}
q_{ij} = \frac{(1 + ||y_{j} - y_i||_2^2)^{-1}}{\sum_{k \neq i} (1 + ||y_{j} - y_i||_2^2)^{-1}}
\end{align*} - The above formula comes from the probability density function of t-distributions with one degree of freedom (i.e., Cauchy distribution)
where:
MNIST Example
cities <- c("Adelaide", "Alice", "Brisbane", "Cairns", "Canberra", "Darwin", "Melbourne", "Perth", "Sydney")
city.dist <- matrix(c(0, 1533, 2044, 3143, 1204, 3042, 728, 2725, 1427,
1533, 0, 3100, 2500, 2680, 1489, 2270, 3630, 2850,
2044, 3100, 0, 1718, 1268, 3415, 1669, 4384, 1010,
3143, 2500, 1718, 0, 2922, 3100, 3387, 5954, 2730,
1204, 2680, 1268, 2922, 0, 3917, 647, 3911, 288,
3042, 1489, 3415, 3100, 3917, 0, 4045, 4250, 3991,
728, 2270, 1669, 3387, 647, 4045, 0, 3430, 963,
2725, 3630, 4384, 5954, 3911, 4250, 3430, 0, 4110,
1427, 2850, 1010, 2730, 288, 3991, 963, 4110, 0),
nrow = length(cities), ncol = length(cities))
colnames(city.dist) <- rownames(city.dist) <- cities
city.dist %>% kbl %>% kable_paper("hover", full_width = TRUE)| Adelaide | Alice | Brisbane | Cairns | Canberra | Darwin | Melbourne | Perth | Sydney | |
|---|---|---|---|---|---|---|---|---|---|
| Adelaide | 0 | 1533 | 2044 | 3143 | 1204 | 3042 | 728 | 2725 | 1427 |
| Alice | 1533 | 0 | 3100 | 2500 | 2680 | 1489 | 2270 | 3630 | 2850 |
| Brisbane | 2044 | 3100 | 0 | 1718 | 1268 | 3415 | 1669 | 4384 | 1010 |
| Cairns | 3143 | 2500 | 1718 | 0 | 2922 | 3100 | 3387 | 5954 | 2730 |
| Canberra | 1204 | 2680 | 1268 | 2922 | 0 | 3917 | 647 | 3911 | 288 |
| Darwin | 3042 | 1489 | 3415 | 3100 | 3917 | 0 | 4045 | 4250 | 3991 |
| Melbourne | 728 | 2270 | 1669 | 3387 | 647 | 4045 | 0 | 3430 | 963 |
| Perth | 2725 | 3630 | 4384 | 5954 | 3911 | 4250 | 3430 | 0 | 4110 |
| Sydney | 1427 | 2850 | 1010 | 2730 | 288 | 3991 | 963 | 4110 | 0 |
Comments on MDS
Benefits & Drawbacks of MDS
How to Interpret MDS Maps